Random Forest Models To Predict Aqueous Solubility

نویسندگان

  • David S. Palmer
  • Noel M. O'Boyle
  • Robert C. Glen
  • John B. O. Mitchell
چکیده

Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prediction of 1-octanol solubilities using data from the Open Notebook Science Challenge

BACKGROUND 1-Octanol solubility is important in a variety of applications involving pharmacology and environmental chemistry. Current models are linear in nature and often require foreknowledge of either melting point or aqueous solubility. Here we extend the range of applicability of 1-octanol solubility models by creating a random forest model that can predict 1-octanol solubilities directly ...

متن کامل

Novel enhanced applications of QSPR models: Temperature dependence of aqueous solubility

A model developed to predict aqueous solubility at different temperatures has been proposed based on quantitative structure-property relationships (QSPR) methodology. The prediction consists of two steps. The first one predicts the value of k parameter in the linear equation lgSw=kT+c, where Sw is the value of solubility and T is the value of temperature. The second step uses Random Forest tech...

متن کامل

Correlation and Prediction of Solubility of CO2 in Amine Aqueous Solutions

The solubility of CO2 in the primary, secondary, tertiary and sterically hindered amine aqueous solutions at various conditions was studied. In the present work, the Modified Kent-Eisenberg (M-KE), the Extended Debye-Hückel (E-DH) and the Pitzer models were employed to study the solubility of CO2 in amine aqueous solutions. Two explicit equations are presented to evalu...

متن کامل

Application of Random Forest and Multiple Linear Regression Techniques to QSPR Prediction of an Aqueous Solubility for Military Compounds.

The relationship between the aqueous solubility of more than two thousand eight hundred organic compounds and their structures was investigated using a QSPR approach based on Simplex Representation of Molecular Structure (SiRMS). The dataset consists of 2537 diverse organic compounds. Multiple Linear Regression (MLR) and Random Forest (RF) methods were used for statistical modeling at the 2D le...

متن کامل

Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules

We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investig...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of chemical information and modeling

دوره 47 1  شماره 

صفحات  -

تاریخ انتشار 2007